library(mice)
library(dplyr)

Learning Objectives


Introduction

Any missing data method involves modeling assumptions. All methods have limitations – better to avoid missing values, or try to minimize the problem.

Missing data undermines randomization, the lynchpin of causal inferences in clinical trials.

Longitudinal studies often have drop-outs
– Move out of study catchment area
– Participation becomes too onerous

Some complete-data problems can be formulated with underlying unobserved data, allowing incomplete-data methods of analysis, e.g. EM algorithm
– Factor analysis – multivariate regression with unobserved regressors
– Mixed-effects models – random effects are unobserved “missing” data
– Genetics – genotypes are “missing”

Important to collect and use relevant covariate information
– Covariates related to missingness and main outcomes

Important to perform sensitivity analyses for nonignorable missing data (to be defined later).

Missing data patterns

Diagram illustrating missing data patterns

Diagram illustrating missing data patterns

Missingness Mechanisms

Let \(Y\) be the data matrix. Let \(Y_{obs}\) be the observed components of \(Y\) and let \(Y_{mis}\) be the missing components of \(Y\). Let \(R\) be a missing data indicator matrix where the \((i,j)\)th element of \(R\) is 1 if the corresponding element of \(Y\) is missing and 0 if it observed. In other words, \(R\) is a missing data indicator matrix where each element in \(Y\) is replaced by either 0 or 1 depending on whether it is observed or missing in \(Y\).

The missing data pattern concerns the distribution of \(R\) whereas the missing data mechanism concerns the distribution of \(R\) given \(Y\). That is, the missing data pattern concerns which data are missing whereas the missing data mechanism concerns why the data are missing.

Missing Completely at Random

MCAR if missingness independent of \(Y\)
\(P(R|Y) = P(R)\) for all \(Y\)

Missing data are missing completely at random (MCAR) if the missing data (and hence the observed data) can be regarded as a simple random sample of the complete data.

The probability that a data value is missing (missingness) is unrelated to the data value itself or to any other value, missing or observed, in the data set.

In the case of dropout, MCAR means that the probability of dropout is unrelated to any characteristics of the subjects.

In the case of missed occasions, MCAR means that the probability of missed occasions is unrelated to any characteristics of the subjects.

Missing at Random

MAR if missingness only depends on observed components \(Y_{obs}\) of \(Y\)
\(P(R|Y) = P(R|Y_{obs})\) for all \(Y\)

If missingness is related to the observed data but - conditioning on the observed data - not to the missing data, then data are missing at random (MAR).

In the case of dropout, MAR means that the probability of dropout may be related to observed covariates and pre-dropout responses (i.e., MAR if dropout depends on values recorded prior to drop-out).

MCAR is a stronger condition and a special case of MAR.

Missing Not at Random

MNAR if missingness depends on missing (as well as perhaps on observed) components of \(Y\).

If missingness is related to the missing values themselves even when the information in the observed data is taken into account, then missing data are missing not at random (MNAR).

  • If conditional on all of the observed data, individuals with higher incomes are more likely than others to withhold information about their incomes, then the missing income data are MNAR.

  • MNAR means that probability of dropout is related to responses at the time of dropout and possibly afterward. MNAR if dropout depends on values that are missing (that is, after drop-out).

  • In the case of missed occasions, missingness probability is a function of the subject’s state (e.g., nurse can not take measurement because patient is too sick in a trial assessing a drug’s effectiveness).

Ignorability

If the data are MCAR or MAR then it is not necessary to model the process that generates the missing data in order to accommodate the missing data. The mechanism that produces the missing data is ignorable.

When data are MNAR, the missing-data mechanism is non-ignorable. It is necessary to model this mechanism to deal with the missing data in a valid manner.

Except in some special situations (e.g., when missing data are missing by design), it is not possible to know whether data are MCAR, MAR, or MNAR.

Showing that missingness on some variable in a data set is related to observed data on other variables, proves that the missing data are not MCAR.

Demonstrating that missingness in a variable is not related to observed data in other variables does not prove that the missing data are MCAR. Non-respondents may be differentiated from respondents in some unobserved manner.

Generally Inappropriate Strategies for Dealing with Missing Data

Complete-case analysis

In complete-case analysis, only observations containing valid data values on every variable are retained for further analysis. Practically, this involves deleting any row with one or more missing values, and is also known as listwise, or case-wise, deletion. Deleting all observations with missing data can reduce statistical power by reducing the available sample size.

Loss of information in incomplete cases has two aspects:
– Increased variance of estimates
– Bias when complete cases differ systematically from incomplete cases (i.e., respondents may differ from nonrespondents)

Complete-case analysis is biased if drop-outs differ.

Assumption: MCAR; that is, provides consistent estimates and valid inferences only when the missing data are MCAR.

Generally inappropriate.

Pairwise deletion

Pairwise deletion (also called available-case analysis) is often considered an alternative to listwise deletion when working with datasets that have missing values. In pairwise deletion, observations are deleted only if they are missing data for the variables involved in a specific analysis.

Although pairwise deletion appears to use all available data, in fact each calculation is based on a different subset of the data. This can lead to distorted and difficult to interpret results.

By basing different statistics on different subsets of the data, available-case analysis can lead to nonsensical results, such as correlations outside the range from -1 to +1 (called non-positive definite covariance matrices).

I recommend staying away from this approach. I only mention it because people still do it.

Last observation carried forward

LOCF imputation for repeated measures with dropouts:
– impute using the last recorded value
– implicit model: values are unchanged after drop‐out

LOCF is mistakenly considered to be valid under MCAR or MAR, but it is generally inappropriate.

Unconditional mean imputation

In unconditional mean imputation, the missing values in a variable are replaced with the mean of that variable (missing values of categorical variables can be replaced by the mode for that variable).

Using mean substitution is likely to underestimate standard errors, distort correlations among variables, and produce incorrect p-values in statistical tests. Mean imputation preserves the means of variables, but it makes their distributions less variable and tends to weaken relationships between variables. It produces biased results for data that are not MCAR.

By treating the missing data as if they were observed, mean imputation exaggerates the effective size of the data set, further distorting statistical inference - a deficiency that it shares with other single imputation methods. The substitution is nonstochastic, meaning that random error is not introduced (unlike with multiple imputation).

I recommend avoiding this approach for most missing-data problems.

Conditional mean imputation

Conditional-mean imputation replaces missing data with predicted values, obtained, for example, from a regression equation (regression imputation).

The imputed observations tend to be less variable than real data, because they lack residual variation.

Another problem is that we have failed to account for uncertainty in the estimation of the regression coefficients used to obtain the imputed values.

Regression imputation improves on unconditional mean imputation, but it still produces biased results for data that are not MCAR.

The first problem with regression imputation (removal of residual variation) can be addressed by adding a randomly sampled residual to each imputed value.

The second problem (uncertainty in the regression coefficients used for prediction of missing data) leads naturally to Bayesian multiple imputation of missing values.

Principled Approaches to Missing Data

Inverse Probability Weighting

Assign a missingness weight to the complete cases to make them more representative of all cases. Standard errors can be estimated analytically or using the bootstrap.

This approach is primarily used for dropout (i.e., the univariate or monotone missingness pattern). It is not as useful for the arbitary missingness pattern.

Analyze the incomplete data

Methods can be applied to incomplete data – that is, there are methods that do not require rectangular (or balanced) data sets. This is often accomplished using full information maximum likelihood (FIML) and the EM (Expectation Maximization) algorithm.

Likelihood inference ignoring the missing data mechanism is valid if
– Model for Y is correctly specified
– Mechanism is MAR (or MCAR)

Missing Data in Mixed Models

Mixed-effects models automatically handle unbalanced data (i.e., unequally spaced responses). So if some responses for some subjects are missing, we may omit the missed occasions and apply ML or REML to the reduced data; this is appropriate if the missing responses are MAR.

If subjects drop out due to death, then ML or REML estimates may be interpreted either as

  1. asymptotically unbiased for the population of live subjects, or

  2. asymptotically unbiased for the population of all subjects, living or dead, assuming that death is MAR or unrelated to the (hypothetical) values that would have been seen had the subjects not died.

This approach works if there is missingness only on the outcome variable. This approach cannot handle missingness on the predictor variables.

If there is missingness on covariates, omitting subjects or occasions with missingness on these covariates implicitly assumes MCAR.

Missing Data in Marginal Models

Recall that marginal models are not fit using likelihood methods.

GEE is semiparametric; it assumes the mean structure (link function and covariates) are specified correctly, but makes no further assumptions about the distribution of the response.

If the working assumption regarding the correlation structure is wrong, GEE and sandwich estimators are consistent under MCAR. If the working assumption regarding the correlation structure is correct, the GEE estimator is consistent under MAR (but the sandwich estimator of the standard errors may not be). By relaxing assumptions on the correlation structure, we have to make unrealistic assumptions regarding the missingness mechanism.

So basically, marginal models assume that the missingness is MCAR.

Multiple Imputation (MI)

Bayesian multiple imputation (MI) is a flexible and general method for dealing with data that are missing at random.

There are uncertainties associated with imputation and this uncertainty can be estimated by using MI. MI propagates imputation uncertainty and is more efficient than a single imputation.

The essential idea of multiple imputation is to reflect the uncertainty associated with missing data by imputing several values for each missing value, each imputed value drawn from the predictive distribution (i.e., from an imputation model) of the missing data, and therefore produce not one but several completed data sets.

Standard methods of statistical analysis are then applied in parallel to the completed data sets (analysis model).

Parameters of interest are estimated along with their standard errors for each imputed data set.

After imputation, the analysis is performed on the \(M\) multiply imputed data sets and the estimates are combined using Rubin’s Rules.

MI workflow

MI workflow

Estimated parameters are averaged across completed data sets.

The MI estimate is \(\bar \theta_M = \frac{1}{M} \sum_{m=1}^{M} \widehat \theta_m\) where \(\theta\) is the parameter of interest and \(\widehat{\theta}_m\) is the estimate from the \(m\)th data set for \(m=1,...,M\).

Standard errors are combined across imputed data sets, taking into account the variation among the estimates in the several data sets, thereby capturing the added uncertainty due to having to impute the missing data.

\(W_m\) is an estimate of the variance from the \(m\)th data set. \(\bar W_M\) is the average within-imputation variance over the \(M\) imputed data sets. The MI estimate of the total variance is \(T_M = \bar W_M + (1 + 1/M)B_M\) where \(B_M = \frac{1}{M-1} \sum_{m=1}^{M} (\widehat \theta_m - \bar \theta_M)^2\) is the between-imputation variance.


The estimated fraction of missing information (fmi):
\(fmi = \frac{(1 + 1/M)B_M}{(1 + 1/M)B_M + \bar W_M}\)

The proportion of total variance that is due to the missing data is defined as \(\lambda = (B + B/m)/T\).

The efficiency of the MI estimator relative to the maximally efficient maximum-likelihood estimator is \(RE = \frac{m}{m- fmi}\).

If the number of imputations is very large, MI is as efficient as ML.

Even when the rate of missing information is high and the number of imputations modest, the relative efficiency of the MI estimator hardly suffers.


An important advantage is that the imputation model and analysis model can differ. In particular, auxiliary variables, \(V\), that are not included in the analysis model can be used in the imputation model, so that the MAR assumption is more plausible.

Marked incompatibility between the data model and imputation model may be a concern. Any variable that will be included the analysis model should also be included in the imputation model.

Multiple imputation cannot preserve features of the data that are not represented in the imputation model.

It is important to insure that the imputation model is consistent with the intended analysis.

Try to include variables in the imputation model that make the assumption of ignorable missingness reasonable.

Finding variables that are highly correlated with a variable that has missing data will likely improve the quality of imputations, as will variables that are related to missingness.

Use all relevant variables, even ones not used in the analysis model.

There is nothing wrong in using the response variable to help impute missing values of explanatory variables. Think of imputation as a pure prediction problem.

Make sure that the imputation model captures relevant features of the data. For example, imputations will not preserve nonlinear relationships and interactions among the variables, unless we make special provision for these features of the data.

Joint vs. Conditional Distribution Imputation Strategies

Conditional imputation has the ability to specify individual regression models for different types of variables. This approach is sometimes called a fully conditional specification or multiple imputation via chained equations.

  • Types of variables
    • Continuous (Normal)
    • Categorical (Logistic or generalized logistic)
    • Count (Poisson)
    • Mixed or semi-continuous (Logistic/Normal)
    • Ordinal (Ordered probit)
  • Parametric or semi-parametric regression models
  • Restrictions
    • Regression model is fitted only to the relevant subset
  • Bounds
    • Draws from a truncated distribution from the corresponding regression model
  • Models each conditional distribution. There is no guarantee that a joint distribution exists with these conditional distributions.

Special Considerations for Multilevel Data

The mixed-effect model is ideal for missingness in the outcome and assuming the missingness is MAR, it poses no special challenges. However, missingess in predictor variables does pose a challenge and is complicated in that it may occur in either level-1 (e.g., time-varying) predictors or level-2 (e.g., time-invariant) predictors. The default in most software is to delete the entire cluster if a level-2 predictor is missing. For example, if hospitals are the level-2 unit and the level-2 variable, hospital type (e.g., trauma center vs. not), is missing for a given hospital, then all observations in that hospital are deleted from the analysis. For longitudinal data, if a level-2 variable, such as gender, is missing, then all of that individual’s observations are deleted.

Ignoring the clustering and imputing the data with a single-level method will underestimate the ICC and this underestimation will grow as the ICC and amount of missing data increase. Single-level imputation should be avoided unless only a few cases contain missing data (e.g., less than 5%) and the ICC is low (e.g., less than .10).

Another ad-hoc technique is to add a dummy variable for each cluster, so that the model estimates a separate coefficient for each cluster. If the primary interest is on the fixed effects, adding a cluster dummy is an easily implementable alternative, unless the missing rate is very large and/or the ICC is very low and the number of records in the cluster is small. However, the bias in random slopes and variance components can be substantial.

Derived variables, such as cluster-level means of level-1 variables, interaction terms, the dummy-coded version of a categorical variable, do not themselves need to be imputed because they can be recalculated from the imputed variables that they are derived from. However, the imputation model needs to be aware of, and account for, the role that such derived variables play in the complete-data model.

Example of MI in R using mice

We will use the NHANES example from the mice package. It is a small data set with 4 variables and 25 observations that was extracted from the much larger NHANES data. It is automatically loaded with the mice package, so you can refer to it without ‘reading’ it in. age is grouped such that 1=20-39, 2=40-59, 3=60+. chl is measured in mg/dL and hyp is 1=no, 2=yes. bmi is body mass index.

Here again is our workflow, but with the R functions listed.

MI workflow

MI workflow

Our goal is to fit a regression model of cholesterol on age and bmi. But as we see below we have some missing values. If we fit a regression model, 12 observations (nearly half of our sample!) are deleted.

summary(nhanes)
##       age            bmi             hyp             chl       
##  Min.   :1.00   Min.   :20.40   Min.   :1.000   Min.   :113.0  
##  1st Qu.:1.00   1st Qu.:22.65   1st Qu.:1.000   1st Qu.:185.0  
##  Median :2.00   Median :26.75   Median :1.000   Median :187.0  
##  Mean   :1.76   Mean   :26.56   Mean   :1.235   Mean   :191.4  
##  3rd Qu.:2.00   3rd Qu.:28.93   3rd Qu.:1.000   3rd Qu.:212.0  
##  Max.   :3.00   Max.   :35.30   Max.   :2.000   Max.   :284.0  
##                 NA's   :9       NA's   :8       NA's   :10
summary(lm(chl ~ age + bmi, data=nhanes))
## 
## Call:
## lm(formula = chl ~ age + bmi, data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.187 -19.517  -0.310   6.915  60.606 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -80.194     58.772  -1.364 0.202327    
## age           53.069     11.293   4.699 0.000842 ***
## bmi            6.884      1.846   3.730 0.003913 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.67 on 10 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.7318, Adjusted R-squared:  0.6781 
## F-statistic: 13.64 on 2 and 10 DF,  p-value: 0.001388

The first step is to examine the missingness.

Examine missing data patterns

md.pattern(nhanes)  # from mice package

##    age hyp bmi chl   
## 13   1   1   1   1  0
## 3    1   1   1   0  1
## 1    1   1   0   1  1
## 1    1   0   0   1  2
## 7    1   0   0   0  3
##      0   8   9  10 27
md.pairs(nhanes)    # from mice package
## $rr
##     age bmi hyp chl
## age  25  16  17  15
## bmi  16  16  16  13
## hyp  17  16  17  14
## chl  15  13  14  15
## 
## $rm
##     age bmi hyp chl
## age   0   9   8  10
## bmi   0   0   0   3
## hyp   0   1   0   3
## chl   0   2   1   0
## 
## $mr
##     age bmi hyp chl
## age   0   0   0   0
## bmi   9   0   1   2
## hyp   8   0   0   1
## chl  10   3   3   0
## 
## $mm
##     age bmi hyp chl
## age   0   0   0   0
## bmi   0   9   8   7
## hyp   0   8   8   7
## chl   0   7   7  10

Perform the multiple imputation

imp <- mice(nhanes, m=20, maxit=25, seed=42, print=FALSE)

You should make sure that the imputations are plausible. For example, you do not want negative BMIs.

imp$imp$bmi
##       1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
## 1  35.3 22.0 20.4 27.2 29.6 27.2 30.1 28.7 28.7 27.2 30.1 27.5 33.2 27.2 33.2
## 3  22.0 26.3 30.1 25.5 22.0 33.2 29.6 33.2 26.3 22.0 28.7 27.4 33.2 30.1 28.7
## 4  21.7 27.4 21.7 27.4 35.3 27.4 21.7 27.5 27.4 20.4 25.5 20.4 21.7 22.5 21.7
## 6  21.7 25.5 20.4 21.7 25.5 25.5 27.4 25.5 21.7 25.5 20.4 21.7 27.4 22.5 21.7
## 10 28.7 20.4 30.1 22.0 22.5 26.3 21.7 27.4 22.7 27.4 27.5 20.4 24.9 29.6 28.7
## 11 22.0 28.7 22.5 27.5 27.4 33.2 27.4 33.2 26.3 35.3 20.4 29.6 29.6 26.3 25.5
## 12 27.2 26.3 27.2 25.5 25.5 29.6 22.7 24.9 35.3 27.5 27.2 25.5 24.9 27.2 28.7
## 16 20.4 35.3 22.5 27.2 29.6 30.1 33.2 33.2 28.7 27.2 27.4 22.7 22.5 22.0 24.9
## 21 28.7 28.7 22.0 28.7 27.4 33.2 33.2 30.1 28.7 30.1 28.7 28.7 30.1 33.2 26.3
##      16   17   18   19   20
## 1  26.3 22.0 22.7 26.3 35.3
## 3  22.0 27.5 27.2 22.0 28.7
## 4  24.9 22.5 35.3 28.7 20.4
## 6  20.4 27.4 22.5 27.5 20.4
## 10 25.5 20.4 22.0 22.0 22.5
## 11 27.2 27.5 29.6 27.2 27.2
## 12 20.4 24.9 26.3 28.7 22.0
## 16 33.2 30.1 29.6 29.6 21.7
## 21 28.7 21.7 20.4 29.6 27.2

Visualize this by overlaying the distribution for each of the 20 imputations on the observed data distribution.

densityplot(imp)

You can obtain the complete data using complete() which will give you the first imputation. If you wanted the second imputation, you would use complete(imp, 2).

mice::complete(imp)
##    age  bmi hyp chl
## 1    1 35.3   2 204
## 2    2 22.7   1 187
## 3    1 22.0   1 187
## 4    3 21.7   2 204
## 5    1 20.4   1 113
## 6    3 21.7   1 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 28.7   1 184
## 11   1 22.0   1 131
## 12   2 27.2   1 186
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 229
## 16   1 20.4   1 187
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 184
## 21   1 28.7   1 238
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 186
## 25   2 27.4   1 186
nhanes
##    age  bmi hyp chl
## 1    1   NA  NA  NA
## 2    2 22.7   1 187
## 3    1   NA   1 187
## 4    3   NA  NA  NA
## 5    1 20.4   1 113
## 6    3   NA  NA 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2   NA  NA  NA
## 11   1   NA  NA  NA
## 12   2   NA  NA  NA
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1  NA
## 16   1   NA  NA  NA
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2  NA
## 21   1   NA  NA  NA
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1  NA
## 25   2 27.4   1 186

You should also check for convergence.

plot(imp)

Fit the analysis model to each of the imputed data sets

fit <- with(imp, lm(chl ~ age + bmi))

Pool (i.e., combine) the results across the imputed data sets

summary(mice::pool(fit))
##              estimate std.error  statistic        df    p.value
## (Intercept) -7.936535 72.919830 -0.1088392 10.967551 0.91529581
## age         32.019372 12.366584  2.5891850  9.261776 0.02860714
## bmi          5.466810  2.316136  2.3603145 11.334229 0.03716714
pool.r.squared(fit)
##           est      lo 95     hi 95 fmi
## R^2 0.4432556 0.05766533 0.7685971 NaN

Checking diagnostics

The example above was well-behaved. Here is a pathological example.

helpmiss <- read.csv("data/helpmiss.csv", header = TRUE)
dat <- dplyr::select(helpmiss, homeless, pcs, mcs, cesd, indtot, pss_fr, drugrisk, satreat, drinkstatus, daysdrink, female, age, treat, anysubstatus, daysanysub)
imp <- mice(dat, m=25, maxit=25, seed=42, print=FALSE)
plot(imp)

densityplot(imp)

Non-ignorable missingness

Nonignorable mechanisms can be included in a missing‐data analysis but this is a difficult modeling problem. Software for fitting non‐ignorable models is not widely available. Design to avoid nonignorable missing data is preferable if possible. Two generic modeling strategies are pattern-mixture models and selection models.

Selection models are:
- more natural substantive model formulation if inference concerns the entire population
- more common approach in literature
- sensitive to specification of the form of the missing‐data mechanism, which is often not well understood

Pattern‐mixture models:
- are more natural when interest is in population strata defined by missing‐data pattern
- are closer to the form of the data and sometimes simpler to fit
- can avoid specifying the form of the missing data mechanism, which is incorporated indirectly via parameter restrictions

But often little is known about the missing‐data mechanism and results may be sensitive to the model formulation. Parameters of missing‐data are often unidentified or weakly identified from the data. Parameters of MNAR models often cannot be reliably estimated – identifiability requires structural assumptions that are often questionable. As a result, it may be more appropriate to do a sensitivity analysis by fixing weakly identified parameters at different values. Varying certain parameters in a sensitivity analysis is the preferred approach. In many (not all) situations, it would be reasonable to choose an MAR primary model, and look at MNAR models via a sensitivity analysis to assess plausible deviations from MAR.

With substantial missing data, sensitivity analyses to assess robustness to alternative analysis models are needed. Sensitivity analysis is a scientific way of attempting to reflect uncertainty arising from potentially MNAR missing data. Deciding on how to implement and interpret a sensitivity analysis is challenging. The need and importance of sensitivity analysis increases with the amount of potentially MNAR missing data. This reinforces the need to limit missing data in the design and implementation stage. Avoiding substantial amounts of missing data is key!